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AFOSR-TR-  8  0-0863 

VECTORIZATION  OF  THE  BEAM-WARMING  SCHEME  ON  THE  CRAY-1  COMPUTER 
’•  action 

In  recent  years  substantial  progress  has  been  made  in  the  area  ' 

of  numerical  simulation  of  fluid  flows.  However,  one  major  limitation 
has  been  the  size  and  speed  of  the  available  computers.  Now  the  new 
generations  of  computers,  including  the  ILLIAC  IV,  the  Cray-1,  and  the 
Star-100,  use  vector  architectures  and  thus  offer  new  possibilities  in 
the  area  of  fluid  flow  simulation. 

The  purpose  of  this  project  was  to  implement  the  Beam-Warming 
scheme  for  solving  the  Navier-Stokes  equations  [1]  on  the  Cray-1.  More 
precisely,  the  plan  was  to  combine  the  best  parts  of  the  codes  written 
by  Steger  and  Pulliam  [2]  for  the  NASA-Ames  CDC  7600  and  by  Lomax  and 
Pulliam  [3]  for  the  NASA-Ames  ILLIAC  IV,  so  as  to  take  maximum  advantage 
of  the  Cray-1.  Our  task  was  complicated  by  the  fact  that  the  major 
loops  in  the  two  codes  were  in  opposite  orders.  However,  we  were  able 
to  capture  all  of  the  important  vector  operations  of  the  ILLIAC  code 
and  to  a  large  extent  accomplish  our  goal. 

2.  Numerical  Scheme 

The  procedure  coded  is  an  implicit  finite-difference  method  for 
simulating  unsteady,  three-dimensional  flow  of  a  compressible,  viscous 
fluid  in  arbitrary  geometry.  The  scheme  assumes  that  the  geometry  of 
the  physical  problem  given  in  x-y-z  space  has  been  mapped  onto  a 
rectangle  in  the  C-u-C  space  and  that  the  metrics  £y,  £_,  n*.  nw. 

V  cx>  and  r,z  and  the  Jacobian  J  of  the  mapping  are  provided. 

(For  work  on  the  mapping  see  [4]  or  [5].)  Also,  since  we  are  interested 
in  high  Reynolds  numbers  flows,  we  use  a  thin  layer  approximation  near 
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the  rigid  boundaries.  (For  a  discussion  of  the  turbulence  model  used, 
see  [6].) 

Incorporating  the  mapping  and  thin  layer  assumptions  into  the  Euler 
equations,  we  are  left  with  the  following  system  of  equations  to  solve 


(1 1  ♦  #1  -  L>  *  ■  E.)  *  £<§  -  L>  -  i  $ 
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p,  u,  v,  w  and  e  are  the  density,  velocity  components  and  internal  energy, 
respectively;  £x.  £z»  nx»  ...»  ct  are  the  metrics  associated  with 

the  mapping;  p,  ic,  Pr,  Re  are  viscosity,  coefficient  of  thermal  conductivity. 


Prandtl  number  and  Reynolds  number,  respectively;  and  E  ,  F  and  G  are  the 
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functions  E,  F  and  G  evaluated  at  free  stream  values  of  q  (this  will 
ensure  that  our  resulting  finite  difference  equations  will  maintain  a 
free  stream  flow). 

Equation  (1)  is  solved  by  the  Beam-Warming  scheme  [1]  or  more 
specifically  the  Steger-Pull iam  implementation  of  the  Beam-Warming 
scheme  [2].  The  technique  uses  trapezoidal  time  differencing,  expansion 
of  the  nonlinear  terms  about  the  nt*1  time  step,  and  approximately 
factoring  the  resulting  equation.  We  then  obtain  the  following  finite 
difference  equation 

(2)  (I^ScAn-eIJ'lVcJ)(I+^SnBn-eIVTiAriJ-^in)(I+MK5cCn-eiVcAi;j)Aq 

=  At(ScEn+6n  Fn+6?Gn-^SnSn)-£|J-l[(VcA?)2+(VnAnf+(VcAc)23Jqn 

where  At  is  the  time  increment;  6^,  6^,  6^  are  central  differences  with 
respect  to  the  appropriate  space  variable;  ej  and  e£  are  amounts  of 
dissipation  added  (second  order  and  fourth  order,  respectively);  all 

functions  with  a  superscript  m  denote  that  function  evaluated  at  time  mAt; 

n+1  n  3E  D  3F  3G  3S 

Ag  =  g  -  g  ;  and  A  =  B  =  C  =  ~  and  M  =  -j— .  For  an  excellent 

paper  describing  the  above  scheme  see  [2]. 

The  scheme  coded  is  to  solve  equation  (2).  This  process  consists  of 

the  following  four  steps:  (1)  the  explicit  computation  of  the  right-hand 

side;  (2)-(4)  the  solution  of  three  block  tridiagonal  systems  of  equations. 

3.  Cray-1  Code 

The  7600  code  to  solve  equation  (2)  is  relatively  straightforward 
(or  at  least  as  straightforward  as  such  a  large  code  can  be).  The 
right-hand  side  is  evaluated  and  then  the  three  block  tridiagonals  are 


solved.  An  important  fact  is  that  the  equation  to  be  solved  is  indexed 
in  the  direction  associated  with  that  particular  factor.  For  example, 
when  solving  the  equation 

(3)  (I  +  4p$r)Bn  -  ejV^J  -  ^~Mn)u  =  Right-hand  side, 

it  is  logical  and  saves  storage  space  to  fix  the  J  and  L  indices  (indices 
associated  with  the  £  and  ?  directions,  respectively)  and  solve  the 
resulting  system  of  equations  indexed  by  K  (index  in  the  n  direction). 

This  allows  the  7600  code  to  solve  a  30-row  (because  of  size  limitations 
30  is  the  largest  number  of  mesh  points  used  in  any  direction)  block 
(5x5  blocks)  tridiagonal. 

Unfortunately,  although  the  above  procedure  is  probably  vectorizable, 
it  is  nowhere  near  optimum  for  the  Cray-1.  The  largest  resulting  vector 
would  be  30  where  multiples  of  64  are  the  optimum  vector  sizes  on  the 
Cray-1 . 

Because  of  the  vector  capabilities  of  the  ILLIAC,  the  ILLIAC  code 
is  quite  different  from  the  7600  code.  The  ILLIAC  code  is  written  in 
CFD,  [7],  which  is  similar  to  FORTRAN  so  parts  were  able  to  be  used 
almost  exactly  as  in  the  ILLIAC  code.  However,  due  to  peculiarities  of 
the  ILLIAC,  much  of  the  code  is  ILLIAC  dependent  (disc  manipulations, 
manipulation  of  scalars,  scalar  arithmetic,  etc.).  Hence,  we  developed 
the  Cray  code  by  using  the  serial  operation  flow  of  the  7600  while 
retaining  all  of  the  vector  portions  of  the  ILLIAC  code  that  are 
performed  often.  For  another  Cray-1  implementation  of  the  ILLIAC  code 
and  the  subsequent  comparison,  see  [8]. 

Much  of  the  logic  In  the  ILLIAC  code  Is  due  to  the  data  structure 
and  manipulation.  One  problem  is  that  the  ILLIAC  has  only  130K  words 
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of  memory  so  to  be  able  to  work  a  very  large  problem  it  is  necessary  to 
use  the  disc  extensively.  Also,  two  characteristics  of  the  ILLIAC  are 
that  it  will  only  handle  vectors  of  length  64  or  less  (and  it  is 
inefficient  to  use  less)  and  it  is  very  difficult  (nearly  impossible)  to 
transfer  any  information  down  the  vector.  Because  of  these  limitations 
the  data  structure  used  in  the  ILLIAC  code  is  to  partition  the  grid 
into  8x8x8  blocks.  A  row  of  these  blocks  in  a  given  direction  is 
then  called  a  pencil  in  that  direction.  Because  of  the  space  limitation 
of  the  memory,  the  scheme  is  to  bring  a  pencil  at  a  time  into  core.  The 
solution  scheme  involves  first  sweeping  through  each  of  the  x-pencils 
and  withthedata  in  the  machine  where  the  n-C  directions  are  the 
components  of  the  vector  (an  ^-pencil  will  contain  JMAX  8x8  n-£  planes), 
calculate  the  necessary  £  differences  in  the  right-hand  side  of  equation 
(2).  The  next  two  steps  involve  doing  the  same  thing  with  the  n  and  £ 
pencils.  These  steps  are  necessary  with  the  ILLIAC  since  the  machine 
will  not  allow  arithmetic  along  a  vector.  This  implementation  is  convenient 
since  it  makes  the  right-hand  side  calculations  vector  operations.  The 
next  three  steps  involve  sweeping  through  the  pencils  in  each  of  the 
directions  again,  at  each  step  solving  the  appropriate  block  tridiagonal 
matrix.  For  example,  equation  (3)  would  be  solved  while  sweeping  the 
n-pencils.  Since  the  left-hand  side  of  equation  (3)  Involves  only  n 
differences,  the  vector  contains  an  8  x  8  piece  of  a  £-c  plane.  This 
makes  solving  equation  (3)  a  perfect  vector  operation.  We  might  add  that 
for  each  of  these  sweeps  the  data  must  be  In  the  machine  in  a  different 
order  (to  make  the  operations  vector  and  to  make  the  matrices  block 
tridiagonals).  Hence,  some  of  the  sweeps  also  Include  the  appropriate 
transposes  to  get  the  data  in  the  right  places.  We  should  also  add  that 
by  doing  the  right-hand  side  calculation  and  solving  the  appropriate  block 
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tridiagonal  matrix  during  the  same  sweep  for  the  C  and  n  directions,  it  is 
necessary  to  sweep  the  data  four  times  rather  than  the  six  indicated  above. 

One  convenient  aspect  about  converting  the  ILLIAC  code  to  the  Cray 
was  that  the  above  described  data  structure  is  also  best  to  use  on  the 
Cray.  Initially  it  might  seem  that  since  the  Cray  vector  operations  have 
no  length  restrictions  (except  that  the  speed  is  optimum  when  the  vector 
length  is  a  multiple  of  64),  the  best  approach  is  to  use  pencils  that 
include  the  entire  grid  or  at  least  bigger  than  8x8  pieces.  This  might 
be  possible  for  problems  if  a  disk  version  of  the  Cray  code  was  desired. 
However,  the  above  scheme  for  the  full  grid  would  require  at  least 
40  x  (grid  size)  +  50  x  (largest  plane  size)  words  of  storage  so  only 
relatively  small  problems  would  fit  on  the  Cray.  (If  the  full  block 
tridiagonal  matrix  is  formed,  the  32  would  have  to  be  increased  to  90.) 

Our  Cray  code  is  written  so  that  it  is  contained  entirely  in  core. 

In  general,  the  block  tridiagonal  solver  will  require 
3  x  (vector  length)  x  25  x  (maximum  pencil  length  -  2).  Our  block 
tridiagonal  solver  generates  the  matrix  during  the  forward  sweep  so  if  8  x  8 
pencils  are  used,  storage  due  to  the  matrix  solver  is  minimal 
(64  x  25  x  (pencil  length  +2)). 

Thus  we  have  used  the  same  data  structure  for  our  Cray  code  as  was 
used  in  the  ILLIAC  code.  The  storage  requirements  for  our  code  are  approxi¬ 
mately  15  x  (grid  size)  +  64  x  43  x  (maximum  pencil  length)  +  64  x  90  so 
at  least  moderately  large  problems  can  be  worked  using  the  code. 

Hence,  the  code  proceeds  to  solver  equation  (2)  by  sweeping  the 
data  to  form  the  right-hand  side  and  again  sweep  the  data  to  solve  the 
matrix  equations,  by  the  same  four  sweeps  used  in  the  ILLIAC  code. 
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4.  Summary 

Unfortunately,  at  the  time  that  it  was  necessary  to  write  this 
report,  we  did  not  have  any  comparisons  to  present.  We  are  presently 
trying  to  ready  the  code  to  reproduce  a  run  common  to  the  7600  and  ILLIAC 

codes.  This  is  not  the  trivial  matter  that  it  should  be  since  the  run  used 

a  30  x  30  x  21  point  mesh  and  the  Cray  code  was  prepared  to  accept  only 

multiples  of  8.  After  the  code  checks  out,  we  will  proceed  to  run  some 

comparisons.  We  will  also  compare  running  the  code  with  our  block 
tridiagonal  solver  and  using  a  canned  block  diagonal  solver  (though  the 
latter  will  require  storage  of  the  full  block  tridiagonal  matrix). 
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